For this assignment, we will be working with part of a Kaggle data set that was the subject of a machine learning competition and is often used for practicing ML models. The goal is classification; specifically, to predict which passengers would survive the Titanic shipwreck.
Fig. 1: RMS Titanic departing Southampton on April 10, 1912.
Load the data from data/titanic.csv into R and
familiarize yourself with the variables it contains using the codebook
(data/titanic_codebook.txt).
Notice that survived and pclass should be
changed to factors. When changing survived to a factor, you
may want to reorder the factor so that “Yes” is the first
level.
Make sure you load the tidyverse and
tidymodels!
Remember that you’ll need to set a seed at the beginning of the document to reproduce your results.
library(tidyverse)
library(tidymodels)
library(rlang)
library(corrr)
library(klaR)
library(discrim)
library(poissonreg)
#library(ISLR) # For the Smarket data set
#library(ISLR2) # For the Bikeshare data set
tidymodels_prefer()
set.seed(22)
# Load Data
rawData <- read.csv("data/titanic.csv")
head(rawData)
## passenger_id survived pclass
## 1 1 No 3
## 2 2 Yes 1
## 3 3 Yes 3
## 4 4 Yes 1
## 5 5 No 3
## 6 6 No 3
## name sex age sib_sp parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## ticket fare cabin embarked
## 1 A/5 21171 7.2500 <NA> S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 <NA> S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 <NA> S
## 6 330877 8.4583 <NA> Q
# Copy dataframe
data <- duplicate(rawData, shallow = FALSE)
#reorder factors
data$survived <- factor(data$survived, levels = c("Yes", "No"))
data$pclass <- factor(data$pclass)
levels(data$survived)
## [1] "Yes" "No"
levels(data$pclass)
## [1] "1" "2" "3"
Split the data, stratifying on the outcome variable,
survived. You should choose the proportions to split the
data into. Verify that the training and testing data sets have the
appropriate number of observations. Take a look at the training data and
note any potential issues, such as missing data.
# Split the data for training/testing
dataSplit <- data %>%
initial_split(prop = 0.8,
strata = survived)
dataTrain <- training(dataSplit)
dataTest <- testing(dataSplit)
#Check dimensions
dim(dataTest)
## [1] 179 12
dim(dataTrain)
## [1] 712 12
head(data)
## passenger_id survived pclass
## 1 1 No 3
## 2 2 Yes 1
## 3 3 Yes 3
## 4 4 Yes 1
## 5 5 No 3
## 6 6 No 3
## name sex age sib_sp parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## ticket fare cabin embarked
## 1 A/5 21171 7.2500 <NA> S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 <NA> S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 <NA> S
## 6 330877 8.4583 <NA> Q
# Return number of survived and did not
count(data, survived)
## survived n
## 1 Yes 342
## 2 No 549
#look at the training data, check for missing values
na_count <-sapply(data, function(y) sum(is.na(y)))
na_count
## passenger_id survived pclass name sex age
## 0 0 0 0 0 177
## sib_sp parch ticket fare cabin embarked
## 0 0 0 0 687 2
Why is it a good idea to use stratified sampling for this data?
We want to use stratified sampling for this data because there is an uneven proportion of passengers who survived and who died. Thus we want the models we train to train on the correct proportion of survivors and non-survivors relative to the total dataset.
Using the training data set, explore/describe the
distribution of the outcome variable survived.
counts <- data.frame(count(dataTrain, survived))
counts
## survived n
## 1 Yes 273
## 2 No 439
sum(counts$n)
## [1] 712
counts %>% mutate(percentage = n/sum(n))
## survived n percentage
## 1 Yes 273 0.383427
## 2 No 439 0.616573
data %>%
ggplot(aes(x = survived)) +
geom_bar()
Out of 712 recorded passengers on the Titanic, 439 or 61.6% did not survive, while 273 (38.3%) did survive.
Using the training data set, create a correlation matrix of all continuous variables. Create a visualization of the matrix, and describe any patterns you see. Are any predictors correlated with each other? Which ones, and in which direction?
We see the pairs (age, sib_sp) and (age, parch) slightly negatively correlated, and (sib_sp, parch), (fare, sib_sp), and (fare,,parch) slightly positively correlated.
names(dataTrain)
## [1] "passenger_id" "survived" "pclass" "name" "sex"
## [6] "age" "sib_sp" "parch" "ticket" "fare"
## [11] "cabin" "embarked"
cor_data <- data %>%
select(-c(survived, pclass, name, sex, ticket, cabin, embarked)) %>%
correlate()
rplot(cor_data)
Using the training data, create a recipe predicting
the outcome variable survived. Include the following
predictors: ticket class, sex, age, number of siblings or spouses
aboard, number of parents or children aboard, and passenger fare.
Recall that there were missing values for age. To deal
with this, add an imputation step using
step_impute_linear(). Next, use step_dummy()
to dummy encode categorical predictors. Finally,
include interactions between:
You’ll need to investigate the tidymodels documentation
to find the appropriate step functions to use.
names(dataTrain)
## [1] "passenger_id" "survived" "pclass" "name" "sex"
## [6] "age" "sib_sp" "parch" "ticket" "fare"
## [11] "cabin" "embarked"
titanicRecipe <- recipe(survived ~ pclass +
sex +
age +
sib_sp +
parch +
fare, data = dataTrain) %>%
step_impute_linear(age) %>%
step_dummy(all_nominal_predictors()) %>%
step_interact(terms = ~ sex:fare) %>%
step_interact(terms = ~ age:fare)
Specify a logistic regression model for
classification using the "glm" engine. Then create a
workflow. Add your model and the appropriate recipe. Finally, use
fit() to apply your workflow to the
training data.
Hint: Make sure to store the results of
fit(). You’ll need them later on.
log_reg <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
log_wkflow <- workflow() %>%
add_model(log_reg) %>%
add_recipe(titanicRecipe)
log_fit <- fit(log_wkflow, dataTrain)
Repeat Question 5, but this time specify a linear
discriminant analysis model for classification using the
"MASS" engine.
lda_mod <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
lda_wkflow <- workflow() %>%
add_model(lda_mod) %>%
add_recipe(titanicRecipe)
lda_fit <- fit(lda_wkflow, dataTrain)
Repeat Question 5, but this time specify a quadratic
discriminant analysis model for classification using the
"MASS" engine.
qda_mod <- discrim_quad() %>%
set_mode("classification") %>%
set_engine("MASS")
qda_wkflow <- workflow() %>%
add_model(qda_mod) %>%
add_recipe(titanicRecipe)
qda_fit <- fit(qda_wkflow, dataTrain)
Repeat Question 5, but this time specify a naive
Bayes model for classification using the "klaR" engine. Set
the usekernel argument to FALSE.
nb_mod <- naive_Bayes() %>%
set_mode("classification") %>%
set_engine("klaR") %>%
set_args(usekernel = FALSE)
nb_wkflow <- workflow() %>%
add_model(nb_mod) %>%
add_recipe(titanicRecipe)
nb_fit <- fit(nb_wkflow, dataTrain)
nb_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: naive_Bayes()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 4 Recipe Steps
##
## • step_impute_linear()
## • step_dummy()
## • step_interact()
## • step_interact()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## $apriori
## grouping
## Yes No
## 0.383427 0.616573
##
## $tables
## $tables$age
## [,1] [,2]
## Yes 28.07076 13.41796
## No 29.17020 13.30665
##
## $tables$sib_sp
## [,1] [,2]
## Yes 0.4798535 0.7177866
## No 0.5717540 1.3340378
##
## $tables$parch
## [,1] [,2]
## Yes 0.4505495 0.7707483
## No 0.3234624 0.7852352
##
## $tables$fare
## [,1] [,2]
## Yes 49.94506 70.92922
## No 22.57262 33.23062
##
## $tables$pclass_X2
## [,1] [,2]
## Yes 0.2490842 0.4332770
## No 0.1822323 0.3864763
##
## $tables$pclass_X3
## [,1] [,2]
## Yes 0.3589744 0.4805807
## No 0.6810934 0.4665845
##
## $tables$sex_male
## [,1] [,2]
## Yes 0.3260073 0.4696109
## No 0.8451025 0.3622197
##
## $tables$age_x_fare
## [,1] [,2]
## Yes 1598.007 2602.336
## No 682.256 1321.477
##
##
## $levels
## [1] "Yes" "No"
##
##
## ...
## and 727 more lines.
Now you’ve fit four different models to your training data.
Use predict() and bind_cols() to generate
predictions using each of these 4 models and your
training data. Then use the accuracy metric to
assess the performance of each of the four models.
Used augment() instead of bind_cols()
Which model achieved the highest accuracy on the training data?
The logistic regression achieved the highest accuracy on the training data, with an accuracy of 80.6%. It was closely followed by the quadratic discriminant analysis model which had an accuracy of 80.4%.
predict(log_fit, new_data = dataTrain, type = "prob")
## # A tibble: 712 × 2
## .pred_Yes .pred_No
## <dbl> <dbl>
## 1 0.0901 0.910
## 2 0.0649 0.935
## 3 0.0956 0.904
## 4 0.315 0.685
## 5 0.143 0.857
## 6 0.780 0.220
## 7 0.0699 0.930
## 8 0.469 0.531
## 9 0.203 0.797
## 10 0.564 0.436
## # … with 702 more rows
predict(lda_fit, new_data = dataTrain, type = "prob")
## # A tibble: 712 × 2
## .pred_Yes .pred_No
## <dbl> <dbl>
## 1 0.0611 0.939
## 2 0.0434 0.957
## 3 0.0635 0.936
## 4 0.257 0.743
## 5 0.0948 0.905
## 6 0.817 0.183
## 7 0.0513 0.949
## 8 0.549 0.451
## 9 0.151 0.849
## 10 0.644 0.356
## # … with 702 more rows
predict(qda_fit, new_data = dataTrain, type = "prob")
## # A tibble: 712 × 2
## .pred_Yes .pred_No
## <dbl> <dbl>
## 1 0.0264 0.974
## 2 0.0175 0.982
## 3 0.0283 0.972
## 4 0.215 0.785
## 5 0.0461 0.954
## 6 0.598 0.402
## 7 0.00000249 1.00
## 8 0.542 0.458
## 9 0.0351 0.965
## 10 0.00375 0.996
## # … with 702 more rows
predict(nb_fit, new_data = dataTrain, type = "prob")
## # A tibble: 712 × 2
## .pred_Yes .pred_No
## <dbl> <dbl>
## 1 0.0233 0.977
## 2 0.0230 0.977
## 3 0.0240 0.976
## 4 0.339 0.661
## 5 0.0254 0.975
## 6 0.455 0.545
## 7 0.00000729 1.00
## 8 0.403 0.597
## 9 0.194 0.806
## 10 0.0132 0.987
## # … with 702 more rows
# assess accuracy of logistic regression model
augment(log_fit, new_data = dataTrain) %>%
conf_mat(truth = survived, estimate = .pred_class)
## Truth
## Prediction Yes No
## Yes 196 61
## No 77 378
log_acc <- augment(log_fit, new_data = dataTrain) %>%
accuracy(truth = survived, estimate = .pred_class)
log_acc
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.806
# assess accuracy of lda model
augment(lda_fit, new_data = dataTrain) %>%
conf_mat(truth = survived, estimate = .pred_class)
## Truth
## Prediction Yes No
## Yes 190 67
## No 83 372
lda_acc <- augment(lda_fit, new_data = dataTrain) %>%
accuracy(truth = survived, estimate = .pred_class)
lda_acc
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.789
# assess accuracy of qda model
augment(qda_fit, new_data = dataTrain) %>%
conf_mat(truth = survived, estimate = .pred_class)
## Truth
## Prediction Yes No
## Yes 197 63
## No 76 376
qda_acc <- augment(qda_fit, new_data = dataTrain) %>%
accuracy(truth = survived, estimate = .pred_class)
qda_acc
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.805
# assess accuracy of naive bayes model
augment(nb_fit, new_data = dataTrain) %>%
conf_mat(truth = survived, estimate = .pred_class)
## Truth
## Prediction Yes No
## Yes 144 37
## No 129 402
nb_acc <- augment(nb_fit, new_data = dataTrain) %>%
accuracy(truth = survived, estimate = .pred_class)
nb_acc
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.767
Fit the model with the highest training accuracy to the testing data. Report the accuracy of the model on the testing data.
The model predicts the survival of persons on the titanic in the testing dataset with an accuracy of 84.9%.
multiMetric <- metric_set(accuracy, sensitivity, specificity)
augment(log_fit, new_data = dataTest) %>%
multiMetric(truth = survived, estimate = .pred_class)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.849
## 2 sensitivity binary 0.739
## 3 specificity binary 0.918
Again using the testing data, create a confusion matrix and visualize it. Plot an ROC curve and calculate the area under it (AUC).
augment(log_fit, new_data = dataTest) %>%
conf_mat(truth = survived, estimate = .pred_class)
## Truth
## Prediction Yes No
## Yes 51 9
## No 18 101
augment(log_fit, new_data = dataTest) %>%
roc_curve(survived, .pred_Yes) %>%
autoplot
augment(log_fit, new_data = dataTest) %>%
roc_auc(survived, .pred_Yes)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.857
How did the model perform? Compare its training and testing accuracies. If the values differ, why do you think this is so?
*The model performed pretty well. Its training accuracy was 80.6%, and its testing accuracy was 84.9%. This is counterintuitive, because I would believe that the model would be more likely to better describe the training data. The difference may be due to random chance, or perhaps passengers in the test data set were more clear-cut as likely to be survivors than those in the training set.
In a binary classification problem, let \(p\) represent the probability of class label \(1\), which implies that \(1 - p\) represents the probability of class label \(0\). The logistic function (also called the “inverse logit”) is the cumulative distribution function of the logistic distribution, which maps a real number z to the open interval \((0, 1)\).
Given that:
\[ p(z)=\frac{e^z}{1+e^z} \]
Prove that the inverse of a logistic function is indeed the logit function:
\[ z(p)=ln\left(\frac{p}{1-p}\right) \]
local_image
Assume that \(z = \beta_0 + \beta_{1}x_{1}\) and \(p = logistic(z)\). How do the odds of the outcome change if you increase \(x_{1}\) by two? Demonstrate this.
Assume now that \(\beta_1\) is negative. What value does \(p\) approach as \(x_{1}\) approaches \(\infty\)? What value does \(p\) approach as \(x_{1}\) approaches \(-\infty\)?
local_image